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Abstract 



1 Introduction 



In this work, we suggest a parameterized statistical 
model (the gamma distribution) for the frequency of 
word occurrences in long strings of english text and 
use this model to build a corresponding thermody- 
namic picture by constructing the partition function. 
We then use our partition function to compute ther- 
modynamic quantities such as the free energy and 
the specific heat. In this approach, the parameters of 
the word frequency model vary from word to word so 
that each word has a different corresponding thermo- 
dynamics and we suggest that differences in the spe- 
cific heat reflect differences in how the words are used 
in language, differentiating keywords from common 
and function words. Finally, we apply our thermo- 
dynamic picture to the problem of retrieval of texts 
based on keywords and suggest some advantages over 
traditional information retrieval methods. 
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Let us imagine that we are looking for some article 
in the Web. Probably the first thing we will do is 
to go to a search engine and type some keywords. 
If we type a query like "I am looking for an article 
about statistical mechanics of images" , although it is 
exactly what we want, we will probably get nothing 
related to the subject or we will get only a content 
partially related to it. In order to have some mean- 
ingful results, one needs to refine the query to some- 
thing like "image" , "statistical mechanics" ignoring 
in this way the structure of the language and using 
some statistical estimations of the parts of the query 
that stick well with its meaning. 

Current web search engines are a product of some 
15 years evolution. This evolution has shown that if 
we are looking for the meaning of a text, we must 
look for specific, statistically salient keywords that 
are supposed to be present in it, largely ignoring the 
syntactic and the semantics structure of the language. 

Probably, the best way to do the analysis of a text, 
written is some language [1], would be to have some 
exact descriptions of the language, for example, a 
weighted context-free grammar [2]. Having in mind 
the Zipf 's law [3] of the frequency distribution of the 
words, even if reasonable grammar exists, in a single 
text of arbitrary length we will have some 40% halo- 
morphemes [4]. As a consequence, the length of the 
grammar will be of the order of the length of the text 
for any text we choose. 

Therefore, it is convenient to consider the language 
as a set of all the texts spoken/written in that lan- 
guage. Using statistical arguments, we do not need 
all texts, but only a significantly large random set of 
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texts in order to treat the problem. 

In this article we propose a statistical physics 
model of the text that treats the text as a large ran- 
dom data set. The text is regarded to be conditioned 
on the language in which the text is written and can 
be restricted on the area to which it belongs, as for 
example "nonlinear physics" or "novels of 17 th cen- 
tury" . 

The model we investigate consists of a text T and a 
vocabulary V, written in some language. The vocab- 
ulary is formed using as a basis some huge collection 
of texts, written in that language. 

The relationship between the vocabulary and the 
text is asymmetric. If we regard an article of non- 
linear science, it is highly probable to find words like 
"chaotic dynamics" or "Hamiltonian" , but highly im- 
probable to find words like "horses" and "knights" . 
Regarding, for example "Don Quixote" , it is just the 
opposite. So a text that treats some subject is highly 
restricted by this subject and the later conditions the 
vocabulary used. The language as a whole has no 
such restriction. Therefore, the relative excess (or 
higher frequency) of a word in the vocabulary is a 
normal situation. 

On the contrary, the relative excess of a word in 
the text has a specific meaning, because if the word 
is with much higher occurrence in the text than in 
the common language, that can be interpreted as an 
indication that this text treats exactly a subject ex- 
pressed by this word, e.g. that the word is a specific 
term or keyword in the text. This is the first class of 
words in the text that we will consider in this article. 

On the other hand, the text will always contain 
words that are common in the language, which have 
more or less the same frequency in any text and in 
the vocabulary. A large fraction of the words of that 
type will be formed by the so called function words. 
These words by themselves carry no meaning but are 
essential for expressing the language structure. A 
typical example of a function word in English is the 
word "the" . The problem with this category is that 
it is not very easy to define it in a way that can be 
implemented by a computer program. A similar and 
strictly defined category is the class of closed class 
words that by definition are the words, which do not 
change their form in any text. 



Finally the third class of words that will follow 
more or less the same frequency distribution in the 
text and in the vocabulary are the common words. 
They serve to transmit the meaning of the text, but 
are common for every text that must explain some 
concept, like for example the word "explain" in this 
sentence. In this class significant deviations between 
different texts and different authors can be expected. 

In the literature, the statistical treatment of the 
text is mainly regarded in relation with the informa- 
tion retrieval (IR) theory, where this consideration 
results very fruitful [5, 6]. 

Another statistical consideration is centered on the 
Zipf law [3, 7] and looks for the relative distribution 
of different words (types) in a collection of texts. The 
Zipf law can be derived from the requirement of max- 
imal information exchange [8] . This approach mainly 
focuses on the tail of the distribution, that is an ex- 
ample of large number of rare events (LNRE). 

In this article we fix the length of the text to some 
reasonable value (10000 words) and consider it in re- 
lation to some dictionary. Having fixed number of 
words in consideration, we do note have to regard 
the LNRE type of distribution. 

The main contributions of this article are: 

• The gamma distribution is a better model of 
word occurrences then other models considered 
in the literature. 

• The specific heats of different words reflect im- 
portant differences in how words are used in lan- 
guage. 

• The thermodynamic picture offers advantages 
when searching for relevant texts based on a set 
of keywords. 

The paper is organized in the following way: In the 
Section 2, we define the model and the approxima- 
tions used. In Section 3 we derive an expression of the 
frequency of a given word in a fixed length text and 
the potential energy corresponding to this probability 
distribution in the thermodynamic limit. In Section 
4 we derive an analytical expression for the free en- 
ergy of a text and the corresponding thermodynamics 
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quantities. Using the results from Section 4, in Sec- 
tion 5 we calculate numerically these thermodynamic 
quantities for a set of arbitrary selected texts and 
we find that the specific terms (keywords) and the 
rest of the text have different thermodynamic behav- 
ior. Section 7 presents our discussion and comments 
about the future directions of the work. Section 8 
briefly summarizes the research related the present 
work, and section 9 presents the conclusions of the 
article. 

2 The Model 

In our approach we use the following metaphor to 
explain the model. We consider the vocabulary as 
a solid-state basement, composed by "molecules", 
which form the parts of the text. The text itself is 
considered as a liquid solution of "molecules", de- 
rived in the same manner as the vocabulary. The 
text and the vocabulary "react" and there exists some 
energy gain when the reaction takes place, so some 
"molecules" are settled down on the solid base. 

As a first approximation, the molecules can be as- 
sumed to react only if they represent one and the 
same word in the text and the vocabulary. A typi- 
cal text has insignificant length compared to the vo- 
cabulary and practically the words of the text will 
"deposit" , except the orthographic errors, the words 
defined in the text and probably the foreign proper 
names. To have a consideration of the text almost 
independent of its length, we can impose the require- 
ment to have equal total number of "molecules" in 
the solid and the liquid phase. This can be achieved 
by replicating the text the times necessary to achieve 
one and the same length of the text and the vocabu- 
lary. 

Our model thus consists of a vocabulary of length 
L v , a text of length L t , and the "molecules" (words) 
of the text w that match to the "molecules" of the vo- 
cabulary. The corresponding number of occurrences 
of these "molecules" are n t {w) and n v (w) for the text 
and for the vocabulary, respectively. In order to fulfill 
the requirement of equal length between the text and 
the vocabulary, we can introduce some standard text 
length Lq and normalize the number of occurrence of 



w according to this length: 

AT f \ T n *(™) AT ( \ T 

N t (w) = Lq — , N V (W) = Lq — . 

For convenience we choose Lq = L t in the numeri- 
cal experiments. We denote the number of deposited 
molecules, normalized to length Lq by m(w). This 
parameter will be used below as an order parameter 
for the system. 

The problem of regarding the text as a thermo- 
dynamic system consists of defining the "molecules" 
w and the energy of the interaction E(w) = 
E(m(w), N t (w), N v (w), Lq) between the language 
and the text. In this article we will regard as 
"molecules" the usual English words, consisting of 
continuous strings of letters, separated by non-letter 
symbols in written texts. In the rest of the article we 
will not distinguish between "molecules" and words. 
As a first approximation we assume that the words 
are independent, e.g. that there is no interaction be- 
tween different words. Due to this assumed indepen- 
dence, the extensive thermodynamics quantities, as 
for example the free energy, will be the sum of the 
corresponding quantities over the words. Therefore, 
we can build a theory, based on a single word and 
extrapolate it on the text. 

Further, we consider that the language (the solid 
compound) imposes some potential energy field with 
strength dependent on the N v , L v but not on the text, 
e.g. not on N t (when it is not required we will omit 
the w argument). We also assume that the system is 
in thermal equilibrium. 

According to this consideration, the probability 
P(m) of the state with m deposited molecules is [11]: 

P(m) cc G(m) exp(-0E(m, N U N V ,L )), (1) 

where E(m, N t , N v , Lq) is the energy of settling m 
molecules, G(m,N t ) is the number of degenerations 
of these states and (3 is the inverse temperature (3 = 
1 /T. The number of degenerations is just the number 
of ways we can select m "molecules" out of a set of 
N t molecules, e.g. G{m,N t ) = (^). Note that this 
number is strictly zero if m > Nt, that reflects the 
fact that we have only Nt molecules. 

Regarding that system, one can impose the require- 
ment that its properties scale with the length of texts 
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e.g. if we scale simultaneously the size of the vocabu- 
lary and the text by s, the thermodynamics potential 
will scale in the following way: 

E{sm, sN t , sN v , sL ) = sE(m, N t ,N v ,L ) 

and 

\og{G(sm,sN t )) = S \og{G{m,N t )). 

This requirements must to be fulfilled only in the 
asymptotic limit, e.g. when s — > oo, which permits 
to use the saddle point approximation in Eq. 1, con- 
sidering as important only the limit 

lim [E(sm, sN t , sL t , sN v ,sL v )(3 + log(G(sm))]/s. 

s— >oo 

3 Frequency of a single word 

Let us consider the frequency of occurrence of a sin- 
gle word w in a text with length L, regarding just 
the case where the word occurs x 1 times. The 
question is what is the probability distribution of a 
given word in this segment of text. The answer can 
be given only by empirical argument investigating a 
large repository of texts. 

The usual hypothesis is that the distribution is bi- 
nomial or mixture of Binomials that corresponds to 
some urn process [7] . More sophisticated models sup- 
pose that the distribution is a mixture of Binomial 
(when the word is not used as a keyword) and a Flat 
distribution (when the word is used as a keyword) 
[12, 13, 14]. Some process is assumed to be responsi- 
ble of this distribution, where the probability of hav- 
ing the word in a text increases if the word is already 
in the text. This leads to a mixture of Poisson pro- 
cesses. 

However, we have found that the distribution is far 
from Binomial. As an illustration, in Fig. 1 we give 
the frequency distribution of the word "the" in the 
Gutenberg collection [15] of texts, with L = 10000. 
This word it is practically impossible to be used as 
a keyword and therefore we can assume that the dis- 
tribution would be simply Binomial. However, it is 
clear that the distribution is not Binomial; it is highly 
skewed and far away from the Binomial distribution 
with that frequency [16]. 
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Figure 1: (Color online) Frequency distribution of the 
word "the" in 10000 consecutive words of the corpus. 
The dots represent the empirical data; the red line the 
best Poisson/Normal distribution fit and the blue line - 
the best Gamma fit. The black line is the binomial dis- 
tribution that corresponds to the empirical parameters. 



Empirically we have found that the distribution is 
Gamma distribution for all the words if the different 
meanings of the homonyms are regarded as different 
words. 

By definition, the Gamma distribution is: 

P(x; w) = e~ x V~ 1 6 Q /r(a), (2) 

where b is a parameter independent of the length of 
the text e.g. it depends only on the word and the 
class of text we are regarding. The parameter a is 
proportional to the length of the text L. 

The empirical proof of the statement about the 
Gamma distribution can be performed on a text cor- 
pus with sufficiently large size, dividing it in small 
fragments. These segments must be chosen with a 
sufficient length L in order to have Lp w 3> 1, where 
p w is the probability of occurrence of the word w. 

We have checked the above hypothesis of Gamma 
distribution on the British National Corpus (BNC) 
[17] and on a set of about 19000 English texts cho- 
sen from the Gutenberg collection and we found an 
excellent agreement (p > 0.8) with the experimental 
data for all the words with p w > 5/10000 [16]. 

The statement that the distribution of a given word 
is Gamma is not common in the literature. In this 
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Figure 2: Potential energy of a word according to the 
number of occurrences. It consists of two parts - the log- 
arithmic falling part varying for values of the argument 
from zero to the mean frequency of the word and a linear 
increasing part, predominant at the range where the fre- 
quency of the word is larger than its mean frequency in 
the language. 



article we do not give a model to explain it. However 
independently of the nature of the underlying process 
we found that the Gamma distribution fits well the 
empirical data. 

Further, we have analyzed the asymptotic be- 
havior of the distribution. To achieve this, we 
replicated the text s times and consideed the limit 
lims-,00 [log P(sx; w\ sa, b)]/s = a — bx — a log a + 
a log x + a log 6. Using that the mean of x is x = ab, 
we obtained for the asymptotic behavior of logP(a;) 
the following final expression: 



-(f) 



E p (x;w) = — \ogP(x) = —xb 1 

(3) 

E p can be regarded as a potential energy of the word 
w in the language. The logarithmic member corre- 
sponds to the entopic part of the energy [18], while 
the linear one accounts for the excess of words of a 
given type in the text. A normalized energy curve is 
given in Fig. 2. 



4 The free energy 

Using the above considerations, the corresponding 
partition function for a given word w is: 

N t 

Z(w, P)=J2 G ( m ' N *)) exp(-/?£ p (m, N t )), (4) 



where we have used the argument that the energy for 
a single word is given by its potential energy Eq.(3). 

Introducing in the above equation the expression 
for the number of degenerations G(m,N t ) = 
and identifying the parameters x = N v and x = m, 
we arrive to the following expression for the partition 
function: 



N t 



Z(w,(3) = eM-PE tot (m,N t )). 



(5) 



m— 1 



Here 



E tot (m,N t ) = -^-\og ( Nt 
p \m 



N„b 



1 



m I m 



(6) 



is the total energy corresponding to some word w and 
we have introduced the degeneration factor inside the 
exponent. 

As can be seen, the total energy for one word is 
composed by a potential part E p and by a combina- 
torial part i logG(m, Nt). 

Finally, the full free energy of the text is a sum 
over all the words of the text: 



(7) 



The equation for the order parameter m can be 
obtained by using the saddle-point method and the 
Stirling approximation, log AH ps NlogN — N, N 3> 
1: 

dF 1 m N v — m 

3Z7 = 7> lo ST7 + b ~ =0- (8) 

am p Nt —m m 

This equation can be solved in a closed form giving 
the following final expression: 



b/3N v /N t 



b(3N v /N t + W{b[3N v /N t e bf}-bfSNjN t y 



(9) 
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where W(.) is the Lambert W function [19]. The 
ratio m/M t is a monotonously increasing function of 
(3 and N v /N t . 

For small values of the N v /N t , the ratio m/N t is 
small for any temperature, growing later above some 
critical value of N v /N t . 

Further we can consider the rest of the thermody- 
namic quantities. The entropy S for a single word 
is: 



OF 



N t log N t — to log m — 
(N t -m)log(N t -m). (10) 



Note that Eq.(10) approaches asymptotically the 
"usual" entropy —to log to for Nt — > oo. However, 
when to is of order Nt, this is not longer true. Sub- 
stituting Eq.(9) in Eq.(10), we obtain explicit expres- 
sion for the entropy as a function of the systems pa- 
rameters. It is monotonously decreasing function of 
(3 and N v /N t . 

The second derivative of the free energy is related 
to the "specific heat" (Fig. 3): 



Cy 



In the context of the statistical model of texts, this 
quantity can be interpreted in the following way: if 
Cy for a given word is high, then replacing this word 
by another one, or omitting it, will introduce rela- 
tively big distortion in the text, leading to significant 
change of the total energy. On the other had, replac- 
ing word with negligible CV, will have no relevant 
consequence on the text. 

We use the usual notation for Cy , adopted in ther- 
modynamics for isochoric process, where the volume 
of the system is fixed, although what is fixed in 
this consideration is the number of occurrence for a 
given word. We also represent a section of Fig. 3 for 
6=1, N v = 5 in the lower panel of the figure. 

As can be seen, Cy starts form zero at T = 0, then 
expresses a maximum for some temperature T, after 
which it further decreases to zero. The temperature 
corresponding to the maximum of Cy is easy to be 
exploited numerically. ItisT TOOX = 2AbN v /N t +lM3 
and it is linear with respect to N v /Nt- The maximum 




Figure 3: (Color online) The "specific heat" Cv- 



value of Cy as a function of the parameter bN v /N t 
is represented in Fig. 4. 

We have tested numerically the dependence of the 
position and the hight of the maximum of the specific 
heat on several lengths of texts in order to see any 
size effects and we have found that the behavior is 
independent on the size [20] . 

Using similar approach for images in the thermo- 
dynamic limit [10], i.e. when the size of the blocks 
goes to infinity, one expects a divergence of the spe- 
cific heat [21]. This is due to the fact that in images 
one can find a homogeneous statistics for different 
resolutions and image sizes and both can go to infin- 
ity. However, similar behavior is not observable in 
the case of texts, because a single text that explains 
a given concept has a rather limited size and a fi- 
nite "resolution", and cannot be extended. That is 
why in our model for statistical mechanics of writ- 
ten texts, considering the words as independent, one 
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max Cv 




Figure 4: The maximal values of CV as a function of 
bN v /Nt- The upper panel is a zoomed version of the left 
one. 



only observes smeared behavior of the specific heat 
parameter. 

5 Numerical experiments 

To check the above results experimentally on real 
texts, we used several corpora of texts. First, we 
used a BNC corpus, as a standard and equilibrated 
corpus of English texts with some 10 s words. Second, 
we used a collection of about 19000 English texts of 
the Gutenberg collection (GC) with size 5.10 7 words. 
To check specific domains we used single articles, as 
well as a collection of 500 articles from the non-linear 
physics archive (NL) offered by the xxx.archiv.gov 
repository. In order to avoid problems with the dif- 
ferent versions of the articles, we used only the first 
version of each article. Also, we used a list of 257 
closed-class words of English instead of the function 
words. 

For estimating the parameters a and b of the 
Gamma distribution of a single word, we used BNC 



and GC that give practically the same results. The 
parameter b is within the range 0.01-20 with an av- 
erage value 0.25 and the parameter a belongs to 
the interval from to 2.6 for a length of the words 
L = 10000. Note that the parameters a and b arc 
well defined and with a sufficient confidence only if 
p w L 3> 1, where for all practical purposes we can 
suppose that 5^1. Thus, within the corpus of 10 8 
words, the parameters are well defined for less then 
2400 words. For the rest of the words we used some 
simplifying assumption due to the difficulty to prove 
or disprove reliably a hypothesis with two degrees of 
freedom (a and b) having less then five measures for 
their estimation. 

The hypothesis we have adopted was that the less 
frequent words have the same value of the parameter 
b for all the words. In this way we could join all the 
words that are not frequent enough for estimating 
that parameter. The results are very close to the 
mean vale of b. The parameter a, being proportional 
to the length of the text, is not so critical to estimate 
(actually we need only N v and 6). 

We expected domain nonspecific behavior of the 
function and the common words, and domain and 
text specific behavior of the keywords. 

Figs. 5 show a typical behavior of Cy for keywords 
(the two upper curves in the upper panel), for func- 
tion words (the two curves upper-down in the same 
panel) and for common words (the lower curve in the 
lower panel). 

As the function words have much higher frequency 
of occurrence, one can expect that they will have pre- 
dominant role in the specific heat. However this is not 
observed. The specific heat for the keywords is much 
higher than the corresponding one for the function 
words. Even smaller specific heat is carried by the 
common words. 

These results can be interpreted as an indica- 
tion than the most vulnerable speech parts are the 
common words and the most resistant ones are the 
domain-specific (keywords). 

Alternatively, one can interpret the temperature 
factor as a weight of the combinatorial term that 
depends only on the text. Thus, it is not surpris- 
ing that the language dependent part (the function 
words) shows the maximal CV at lower temperature 
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Cv 




Figure 5: CV for different words of one and the same 
text. The upper two curves of the left panel represent two 
keywords of a given text ( "topology" and "topological" ) . 
The lower curves of the left panel represent two functional 
words ("the" and "are"). On the lower panel the curve 
of "are" is zoomed in order to represent also the typical 
common word "important" . 



(see Eq. 6). On the contrary, the keywords in the 
text, which are not so language dependent, have the 
maximum of Cy for higher temperatures. 

Considering all the words and having the parame- 
ters x and b for each of them, we can calculate nu- 
merically the free energy F, the entropy S and the 
specific heat Cy for the whole text. The result for 
Cy is shown in Fig. 6. 

What is observed experimentally is the lack of 
a well pronounces maxima of Cy for the function 
words, less expressed maxima for the common words 
and well pronounces maxima for the keywords. The 
function words express the structure of the language, 
e.g. represent its grammar. The keywords, on the 
other hand, are expressions of the semantic and the 
pragmatic structure of the text. If we represent that 




0.05 0.1 0.15 0.2 

T=l/|3 



Figure 6: (Color online) Experimentally measured Cv 
on a single text. The part of the Cv corresponding to the 
common words is very small to be shown in that scale. 



structure as a semantic graph, similar to [22] , we can 
suppose that the keywords reflect the structure of 
that graph independently of the grammar or the lan- 
guage we chose to express it as a text. 

According to Fig. 6 we can observe that there is 
a wide temperature range between the maximum of 
Cy corresponding to function words and the maxima 
of the keywords. Within this area we can expect that 
the solution will contain few function words but the 
rest of the words will be sufficient for the interpreta- 
tion of the text. 

In order to check that, we took an abstract of a 
given article and deleted the deposited words with 
a probability m/N t . The result is shown in the fol- 
lowing boxes, where we represented the same text 
for different values of the temperature. The over- 
stroked words are chosen by their probability of de- 
position. It can be seen that the method extracts 
very well the meaning and ignores the language struc- 
ture. The extraction is perfect in the last box, where 
the temperature is lower. Note that the words are 
represented only by their parameters x and b. The 
program has no notion of "function word" "common 
word" or "keywords" . 



■ Gardner , we /MflM 

wmmmtitimm m M¥ mm transition mm 

parameters a symmetric fflfffft- with small 
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world #■ mean - field ^MMW^M ■ M- was 

found that the topology dependence f$ff be described jtfff. 

j #f #. small j S M M of parameters , namely jt/ffl ff jffifffffjfjji j 

of existence # loops fflH l&iM length . 

ffiffiffi ffiffiffi topology , closed f fMtffjjfffj ffjf ffi fcWffifiW 

Wiihfjffl^fffffff; parameters was fij^fjfffjfffy ff easily^- be 

solved. 

1. T=100. In this case the temperature is very high, 
which makes the text almost unreadable. 



M W f M WfWM Gardner, we calculate the 
WWtftiiftttii capacity ^/ phase transition 
parameters a symmetric Hebb fjifjfffjfffif Mhtyj small 
world jf$jfltf§f #■ mean - fie l d f4W$Mt$'M ■ ffl- wa s 
found that the topology dependence ^r/ be described $ff- 
Ismail mm- ot parameters , namely MlMW)W 
of existence loops jfffff given length . ^ <ffff,ff fjff 
fjtffj^j world topology , closed algebraic ff$L ffi equations 
vnth.$ ft # W $. parameters was fft ffl ff ffl jfl, % easily^ be 
solved . 

2. T=0.167. This version of the text corresponds to 
values of the parameters that correspond to the region 
located after the peak of the "specific heat" CV, corre- 
sponding to the keywords. 



$$$$$1. Following Gardner, calculate the 

tttWMMi capacity $M Phase transition /f/f 
parameters a symmetric Hebb ji/ifjffjijffjif fffjjj small 
world topology mean - field approximation . jiff, ffff^f 
found that the topology dependence f$ff be described 
fffifff small ffffffffyiff of parameters , namely $ffif- probabil- 
ity of existence jffi. loops fffffff given length . jiffy ffff <fff4/4- 
-ffl. small world topology , closed algebraic ffff fff equations 
with ftffffjf fffffftf parameters was ffffffft ffffff ff easily^ be 
solved. 

3. T=0.05. Extraction of the text for values of the 
parameters that correspond to the region located between 
the peaks of the "specific heat" CV corresponding to the 
function words and the keywords. 



fffffffff^fff. Following Gardner, $ff- calculate the informa- 
tion capacity f^ffff ffffffff phase transition related param- 
eters Sffyf. a symmetric Hebb network fffffff small world 
topology ffff, mean - field approximation . ff fffff found 
that the topology dependence (fffff be described fff fffff jf 
small number of parameters , namely ffff probability of 



existence # loops jfffff given length . If4-#f4 ffl>l# M 
small world topology , closed algebraic ffff ffi equations 
with MWM^ Parameters fflfflffflfflfflfflft easily MM 
solved. 

4. T=0.0125. The extraction of the text corresponding 
to values of the parameters that correspond to the region 
located it in the vicinity of the peak of the "specific heat" 
CV, corresponding to the function words. 



6 Toy Application 




Figure 7: (Color online) We regard the query as a subset 
of words. The effective energy will give as a measure of 
the relevance of the query. Prob(Q) oc exp(—E(v)z)/WT — 
E(w 4 )/kT) 

Let us consider the above consideration in the fol- 
lowing context: we consider a text and we are asking 
to what extend certain word is characteristics for the 
text. 

If the word is a keyword, it is of course charac- 
teristics for the text. A typical information retrieval 
application tries to use these kinds of words in or- 
der to extract a text relevant to some query. The 
probability to have m times some word w relative to 
the mean frequency of this word in the language is 
oc exp(—Et™t(m,(3), (see Eq.6). Then, for a set of 
different words as independent entities, the probabil- 
ity to have this set of words in the text would be 
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oc exp(-J2 w Etrt (m,/3)). 

If we ask whether some set of words Q = 
{w q i, w q 2, w qm } are relevant to the text, and if 
relevant is considered as much more probable that 
its average use in the language, then Q is relevant if 
the energy of the words forming Q is high. If some 
words occur in Q and not in the text, then the Gibbs 
multiplier will be zero and this word will be ignored. 

The concept is very easy to implement. Just cal- 
culate the effective energy of the words in a text and 
store them as pairs (w, E^ W \T)) for several tempera- 
tures. Then using the query, one can sum the energies 
of the words (see Fig. 7). According to the present 
theory, the quality of the result of the query does not 
depend on the length of the text and the query. 

The query can perform better than our model 
which assumes the independence of the words in the 
text and considers the query and the text as a set 
of words. This model is close to the vector informa- 
tion retrieval model [6], but it is richer, because the 
Gamma distribution is bi-parametric one. 

Query performance on a real implementation is 
currently under evaluation using strict IR criteria and 
future results will be published elsewhere. 

7 Discussion and future direc- 
tions 

As has been shown from the above results, the sta- 
tistical mechanics approach permits a relatively easy 
theoretical analysis and a very fast simulation proce- 
dure, which make it promising. 

The method has some advantages in comparison 
with the usual IR methods. First of all the queries 
correspond to the real probability measures condi- 
tioned to the language. There is no empirical moment 
of choice. Second, it is relatively easy to introduce 
a interaction between the words, e.g. to introduce 
conditional probabilities that goes beyond simple bi- 
grammar models [23]. Because the stable bi-grams 
are much more frequent in one and the same text 
than throughout the corpus, it is logical to suppose 
that the interactions are week. If one introduces them 
as a perturbation of the energy, the resulting model 



can be very resistant to errors and on the same time 
can respect the language structure. 

As a further step, we can consider different modifi- 
cations of the model proposed in this article. For ex- 
ample, the potential energy, derived experimentally 
and corresponding to the frequencies of the words in 
texts with fixed length can be substituted by differ- 
ent functions seeking different characteristics of the 
text. 

In this paper we use the words as a convenient 
starting point. However the approach is not limited 
to words. Another interesting choice is the use of 
maximal common prefixes, e.g. the strings of the 
texts with maximal length that coincide. 

Allowing only non-overlapping strings and condi- 
tioning the text to itself, the number of "molecules" 
in T = 1 would be the length of LZ compressed file 
and therefore resembles its Kolmogorov complexity 
[24]. Distances similar to that used by [25, 26] can 
be easily calculated introducing a chemical potential. 
The disadvantage of these distances is that they are 
not operational with short texts and keywords. Thus 
although they give best results in tasks measuring 
proximity of texts, they arc difficult to use for in- 
formation retrieval purposes. This is due to the ex- 
tremely sparse representation needed in order to com- 
press the text. 

The fact that this type of distances can be regarded 
as an extreme case, gives us the ground to expect that 
the behavior of the system would be richer within the 
finite temperature range. 

Using overlapping strings and conditioning not 
only between two texts, but also between the lan- 
guage, the knowledge area, the author and similar 
characteristics, can give much denser representation 
and could lead to very interesting information re- 
trieval applications. 

8 Related work 

The problem of keyword detection starts with the 
seminal work by H. P. Luhn [27] in which he uses 
statistical information derived from word frequency 
and distribution to compute a relative measure of 
significance, first for individual words and then for 
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sentences with the purpose of outputting an "auto- 
abstract" of the article in question. 

In last recent years, this field has been addressed 
with Statistical Mechanics tools. The contributions 
[28, 29, 30] focus on statistical information referring 
to the spatial use of the words in human written texts 
as opposed to the spatial distribution of words in ran- 
dom shuffled texts. They argue that the spatial devi- 
ation of the distances between successive occurrences 
of a word is an excellent parameter to quantify its rel- 
evance for the text, i.e. keywords tend to form clus- 
ters, while function words are essentially uniformly 
spread in a text. 

A work of different nature, but also in the statistics 
field, has recently shown that, by considering words 
as a network of interacting letters, the maximum en- 
tropy models, which are consistent with pairwise cor- 
relations among letters, provide a very good approx- 
imation to the full statistics of four letter words [9] . 

Using a similar approach to the one presented in 
this article, but in the context of image analysis, a 
statistical mechanics formulation has been defined for 
the distribution of small image patches [10]. By as- 
suming Boltzmann distribution of the patches, the 
authors derived the entropy and the heat capacity 
and showed that the behavior of the heat capacity is 
divergent in the vicinity of the critical temperature. 

9 Conclusion 

In the present article we propose a statistical physics 
approach for the analysis of human written text. By 
introducing the concept of energy of interaction be- 
tween the text and the corpus (the language), and 
taking into consideration the realistic distribution of 
the words inside a given large text corpus, we are 
able to derive the thermodynamics parameters of the 
system in a closed analytical form. 

The behavior of the specific heat of the system is 
different for different kinds of words (keywords, func- 
tion words and common words). It is universal and 
independent on the selected text. We also show that 
the temperature range, where the maxima for these 
types of words occur, is different. 

Finally we discuss a possible application of the 



method in order to construct queries on text 
database. Thus, without having knowledge of the 
text, we could judge about the structure and the 
functionality of the different parts of the text, which 
could be useful for several information retrieval ap- 
plications. 
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